Store Sales - Forecasting by Linear Regression¶

We will go deeper into these section:

  • Interpolation for Oil Prices
  • Exploratory Data Analysis
  • Modelling

The objective is to predict sales using the test_data. This data is provided by Corporación Favorita, a major grocery retailer based in Ecuador. The company also has operations in several other South American countries.

Click here to learn about the company.


No description has been provided for this image
No description has been provided for this image

The dataset includes 54 stores and 33 product families. The time series spans from 2013-01-01 to 2017-08-31. Kaggle has divided the data into two parts: train and test. The test data covers a 15-day period following the end date of the training data. This date range in the test data is crucial for defining a cross-validation strategy and generating new features.

*The objective is to predict sales using the test_data.*

We will examine five datasets step by step:

  1. Train
  2. Test
  3. Store
  4. Transactions
  5. Daily Oil Price

Train data includes time series data for combinations of stores and product families. The sales column indicates the total sales for a product family at a particular store on a specific date. Fractional sales are possible due to the sale of items in fractional quantities (e.g., 1.5 kg of cheese vs. 1 bag of chips). The onpromotion column shows the total number of items in a product family being promoted at a store on a given date.

Store data provides information about each store, including city, state, type, and cluster.

Transaction data is highly correlated with the sales column in the train dataset, allowing us to understand sales patterns at the stores.

Daily Oil Price data is another valuable dataset. Given Ecuador's dependence on oil, its economic health is particularly sensitive to fluctuations in oil prices. This data will help us determine how oil prices impact different product families.

The "Additional Notes" section in the data description may reveal significant patterns or anomalies. Here are some notes to keep in mind:¶

  • Public sector wages are paid bi-weekly on the 15th and the last day of the month, which could influence supermarket sales.
  • A magnitude 7.8 earthquake struck Ecuador on April 16, 2016, prompting relief efforts that included donations of essential goods. This event significantly affected supermarket sales for several weeks following the earthquake.

Let's get started!

No description has been provided for this image

1. Importing Important Packages¶

In [39]:
import numpy as np
import pandas as pd
import os
import gc
import warnings
import statsmodels.api as sm      # If PACF - ACF will be used...
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Configurations
pd.set_option('display.max_columns', None)
pd.options.display.float_format = '{:.2f}'.format
warnings.filterwarnings('ignore')

2. Importing Several Dataset¶

In [4]:
# Import
train = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\train.csv')
test = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\test.csv')
stores = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\stores.csv')
#sub = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\sub.csv')
transactions = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\transactions.csv').sort_values(["store_nbr", "date"])

# Datetime
train["date"] = pd.to_datetime(train.date)
test["date"] = pd.to_datetime(test.date)
transactions["date"] = pd.to_datetime(transactions.date)

# Data types
train.onpromotion = train.onpromotion.astype("float16")
train.sales = train.sales.astype("float32")
stores.cluster = stores.cluster.astype("int8")

train.head()
Out[4]:
id date store_nbr family sales onpromotion
0 0 2013-01-01 1 AUTOMOTIVE 0.00 0.00
1 1 2013-01-01 1 BABY CARE 0.00 0.00
2 2 2013-01-01 1 BEAUTY 0.00 0.00
3 3 2013-01-01 1 BEVERAGES 0.00 0.00
4 4 2013-01-01 1 BOOKS 0.00 0.00

3. Transactions¶

Deep Delve Into Transactions Dataset

In [5]:
transactions.head(15)
Out[5]:
date store_nbr transactions
1 2013-01-02 1 2111
47 2013-01-03 1 1833
93 2013-01-04 1 1863
139 2013-01-05 1 1509
185 2013-01-06 1 520
231 2013-01-07 1 1807
277 2013-01-08 1 1869
323 2013-01-09 1 1910
369 2013-01-10 1 1679
415 2013-01-11 1 1813
461 2013-01-12 1 1473
507 2013-01-13 1 542
553 2013-01-14 1 1780
599 2013-01-15 1 1680
645 2013-01-16 1 1933

This feature shows a strong correlation with sales, but initially, we need to aggregate the sales data to identify the relationship. Transactions refer to the number of customers visiting the store or the number of invoices issued in a day.

The sales data represents the total revenue for a product family at a specific store on a given date. Since products can be sold in fractional quantities (e.g., 1.5 kg of cheese instead of just one bag of chips), fractional sales values are possible.

Therefore, transactions will be a key feature in the model. In the subsequent sections, we will create new features based on transaction data.

Transactions Heatmap¶

In [6]:
temp = pd.merge(train.groupby(["date", "store_nbr"]).sales.sum().reset_index(), transactions, how = "left")
print("Spearman Correlation between Total Sales and Transactions: {:,.5f}".format(temp.corr("spearman").sales.loc["transactions"]))
px.line(transactions.sort_values(["store_nbr", "date"]), x='date', y='transactions', color='store_nbr',title = "Transactions Heatmap" )
Spearman Correlation between Total Sales and Transactions: 0.81746

Transactions Boxplot¶

In [7]:
a = transactions.copy()
a["year"] = a.date.dt.year
a["month"] = a.date.dt.month
px.box(a, x="year", y="transactions" , color = "month", title = "Transactions Boxplot")

Transactions exhibit a consistent pattern, with all months showing similar trends except for December, as observed from 2013 to 2017 in the boxplot. Additionally, the previous plot revealed the same pattern across all stores, where sales consistently increased towards the end of the year.

Transactions by using monthly average sales!

We have identified a pattern that boosts sales, which occurs at the end of the year. Additionally, transactions tend to rise during the spring season and decline afterward.

In [8]:
import plotly.express as px

# Processing Data
a = transactions.set_index("date").resample("M").transactions.mean().reset_index()
a["year"] = a.date.dt.year

# Making Graphic
fig = px.line(
    a, 
    x='date', 
    y='transactions', 
    color='year', 
    title="Monthly Average Transactions",
    labels={'transactions': 'Average Transactions', 'date': 'Date'},
    line_shape='linear',  # Bentuk garis linear
    markers=True,         # Menampilkan marker di titik data
    template='plotly_white',  # Tema latar putih
    color_discrete_sequence=px.colors.qualitative.Set1,  # Skema warna diskrit
)

# Adding Layout and Design Fatures 
fig.update_layout(
    title={
        'text': "Monthly Average Transactions Over Time",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
    },
    xaxis_title='Date',
    yaxis_title='Average Transactions',
    legend_title_text='Year',
    font=dict(
        family="Arial, sans-serif",
        size=14,
        color="black"
    ),
    hovermode="x unified",
    margin=dict(l=50, r=50, t=80, b=50),
    plot_bgcolor='rgba(240,240,240,0.8)',  # Background plot berwarna abu-abu terang
    paper_bgcolor='white',  # Background kertas putih
)

# Adding Grid Line
fig.update_xaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')

fig.show()

When we look at their relationship, we can see that there is a highly correlation between total sales and transactions also.

In [9]:
px.scatter(temp, x = "transactions", y = "sales", trendline = "ols", trendline_color_override = "red")

The days of the week play a crucial role in shopping behavior, revealing a significant pattern. Stores experience higher transaction volumes on weekends, with Saturday being the peak day for shopping. This trend remains consistent from 2013 to 2017.

In [10]:
import plotly.express as px

# Menyalin data dan menambahkan kolom tahun dan hari dalam minggu
a = transactions.copy()
a["year"] = a.date.dt.year
a["dayofweek"] = a.date.dt.dayofweek + 1  # Menambahkan 1 untuk membuat hari mulai dari 1 (Senin) hingga 7 (Minggu)

# Menghitung rata-rata transaksi per tahun dan hari dalam minggu
a = a.groupby(["year", "dayofweek"]).transactions.mean().reset_index()

# Membuat grafik garis
fig = px.line(
    a, 
    x="dayofweek", 
    y="transactions", 
    color="year",
    title="Average Transactions by Day of the Week",
    labels={
        "dayofweek": "Day of the Week", 
        "transactions": "Average Transactions",
        "year": "Year"
    },
    markers=True,  # Menambahkan marker pada titik data
    line_shape='linear',  # Bentuk garis linear
    template="plotly_white",  # Tema latar putih
    color_discrete_sequence=px.colors.qualitative.Plotly,  # Palet warna diskrit
)

# Menambahkan layout dan fitur desain
fig.update_layout(
    title={
        'text': "Average Transactions by Day of the Week",
        'y':0.95,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
    },
    xaxis_title='Day of the Week',
    yaxis_title='Average Transactions',
    font=dict(
        family="Arial, sans-serif",
        size=14,
        color="black"
    ),
    hovermode="x unified",
    margin=dict(l=50, r=50, t=80, b=50),
    plot_bgcolor='rgba(240,240,240,0.8)',  # Background plot berwarna abu-abu terang
    paper_bgcolor='white',  # Background kertas putih
)

# Mengatur sumbu agar grid lebih jelas
fig.update_xaxes(
    tickmode='array',
    tickvals=[1, 2, 3, 4, 5, 6, 7],
    ticktext=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
    showgrid=True, 
    gridwidth=1, 
    gridcolor='lightgray'
)
fig.update_yaxes(showgrid=True, gridwidth=1, gridcolor='lightgray')

fig.show()

4. Oil Price Impact¶

The economy poses a significant challenge for both governments and individuals, impacting various aspects in either positive or negative ways. In this context, Ecuador, being reliant on oil, will experience fluctuations in its model due to changes in oil prices. To gain a deeper understanding, I examined Ecuador's economic situation and discovered a relevant article from the IMF. Reading this article is recommended if you aim to enhance your models using oil data.

  • https://www.imf.org/en/News/Articles/2019/03/20/NA032119-Ecuador-New-Economic-Plan-Explained

No description has been provided for this image

As shown below, there are some missing data points in the daily oil price dataset. Various imputation methods could be used to handle these gaps, but I opted for a straightforward approach. Linear interpolation is well-suited for this time series data. By examining a time series plot of oil prices, you can observe the trend and estimate the missing data points.

In [11]:
# Importing Oil Dataset
oil = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\oil.csv')
oil["date"] = pd.to_datetime(oil.date)

# Resample
oil = oil.set_index("date").dcoilwtico.resample("D").sum().reset_index()

# Interpolate
oil["dcoilwtico"] = np.where(oil["dcoilwtico"] == 0, np.nan, oil["dcoilwtico"])
oil["dcoilwtico_interpolated"] = oil.dcoilwtico.interpolate()

# Melt Data
p = oil.melt(id_vars=['date']+list(oil.keys()[5:]), var_name='Legend')

# Plot
fig = px.line(
    p.sort_values(["Legend", "date"], ascending=[False, True]), 
    x='date', 
    y='value', 
    color='Legend',
    title="Daily Oil Price",
    width=1000,  # Mengatur lebar plot
    height=600   # Mengatur panjang plot
)

# Menampilkan plot
fig.show()

Is it accurate to say that Ecuador relies heavily on oil? Can this dependency be observed from the data?

To address this, let's first examine the correlations between sales and transactions. Although the correlation values are not very strong, there is a negative sign for sales. This might provide some insight. Intuitively, if the daily oil price is high, it suggests that Ecuador's economy might be struggling, leading to increased product prices and decreased sales. This indicates a negative relationship.

In [12]:
temp = pd.merge(temp, oil, how = "left")
print("Correlation with Daily Oil Prices")
print(temp.drop(["store_nbr", "dcoilwtico"], axis = 1).corr("spearman").dcoilwtico_interpolated.loc[["sales", "transactions"]], "\n")


fig, axes = plt.subplots(1, 2, figsize = (15,5))
temp.plot.scatter(x = "dcoilwtico_interpolated", y = "transactions", ax=axes[0], color = 'y')
temp.plot.scatter(x = "dcoilwtico_interpolated", y = "sales", ax=axes[1], color = "g")
axes[0].set_title('Daily oil price & Transactions', fontsize = 15)
axes[1].set_title('Daily Oil Price & Sales', fontsize = 15);
Correlation with Daily Oil Prices
sales          -0.30
transactions    0.04
Name: dcoilwtico_interpolated, dtype: float64 

No description has been provided for this image

We should never base your decisions solely on a single graph or result! It's crucial to reassess your perspective and formulate new hypotheses.

If we had relied on simple outputs, like those above, and concluded that there was no relationship with oil prices, we might have mistakenly decided not to use oil price data.

Now that we understand the importance of thorough analysis, let’s create a scatter plot, but this time we should focus on different product families. Most plots reveal a similar pattern: when the daily oil price is below approximately 70, sales tend to be higher. We can observe two distinct clusters here: one above 70 and one below. This distinction is quite clsight.

5. Sales¶

The primary goal is to predict store sales for each product family. To achieve this, the sales column must be analyzed in depth. We need to understand various aspects, including seasonality, trends, anomalies, and any correlations with other time series data.

When we analyze the stores using a correlation matrix, we find that most of them are quite similar. However, stores such as 20, 21, 22, and 52 appear to be somewhat different.

In [13]:
# Data preparation
a = train[["store_nbr", "sales"]]
a["ind"] = 1
a["ind"] = a.groupby("store_nbr").ind.cumsum().values
a = pd.pivot(a, index="ind", columns="store_nbr", values="sales").corr()
mask = np.triu(a.corr())

# Plot
plt.figure(figsize=(20, 20))
ax = sns.heatmap(
    a,
    annot=True,
    fmt='.1f',
    cmap='coolwarm',  # Mengganti palet warna
    square=True,
    mask=mask,
    linewidths=1,
    cbar=True  # Menambahkan colorbar
)

# Menambahkan title pada colorbar dan mengatur ukuran
colorbar = ax.collections[0].colorbar
colorbar.set_label('Correlation Value', fontsize=15)
colorbar.ax.tick_params(labelsize=10)  # Mengatur ukuran font angka pada colorbar

# Mengatur ukuran colorbar secara manual
colorbar.ax.set_aspect(20)  # Mengatur rasio aspek colorbar (Lebar/Tinggi)
plt.subplots_adjust(right=0.85)  # Menyesuaikan plot untuk memberi ruang pada colorbar

plt.title("Correlations among Stores", fontsize=20)
plt.show()
No description has been provided for this image

And there is a graph that shows us daily total sales below.

In [14]:
# Data preparation
a = train.set_index("date").groupby("store_nbr").resample("D").sales.sum().reset_index()

# Plot
fig = px.line(
    a, 
    x="date", 
    y="sales", 
    color="store_nbr", 
    title="Daily Total Sales of the Stores",
    width=1000,   # Mengatur lebar plot (dalam piksel)
    height=600    # Mengatur panjang plot (dalam piksel)
)

# Tampilkan plot
fig.show()

Some rows in the store time series data are unnecessary. By examining the data for each store individually, it becomes clear that certain stores, specifically stores 20, 21, 22, 29, 36, 42, 52, and 53, show no sales at the beginning of 2013. To address this, we will remove the rows for these stores before their opening dates. The following code will handle the removal of these rows.

In [15]:
print(train.shape)
train = train[~((train.store_nbr == 52) & (train.date < "2017-04-20"))]
train = train[~((train.store_nbr == 22) & (train.date < "2015-10-09"))]
train = train[~((train.store_nbr == 42) & (train.date < "2015-08-21"))]
train = train[~((train.store_nbr == 21) & (train.date < "2015-07-24"))]
train = train[~((train.store_nbr == 29) & (train.date < "2015-03-20"))]
train = train[~((train.store_nbr == 20) & (train.date < "2015-02-13"))]
train = train[~((train.store_nbr == 53) & (train.date < "2014-05-29"))]
train = train[~((train.store_nbr == 36) & (train.date < "2013-05-09"))]
train.shape
(3000888, 6)
Out[15]:
(2780316, 6)
In [16]:
# To know features in train and test data
print(train.columns)
print(test.columns)
Index(['id', 'date', 'store_nbr', 'family', 'sales', 'onpromotion'], dtype='object')
Index(['id', 'date', 'store_nbr', 'family', 'onpromotion'], dtype='object')
In [17]:
# Add 'sales' feature in test_data
test['sales'] = 0
In [18]:
# Modelling
# Predict 'sales' feature in test_data according to each stores and sum of transactions

from sklearn.linear_model import LinearRegression

# Misalkan kita hanya menggunakan kolom 'onpromotion' dan 'store_nbr' dari train untuk prediksi
X_train = train[['onpromotion', 'store_nbr']]
y_train = train['sales']
X_test = test[['onpromotion', 'store_nbr']]

# Membuat dan melatih model
model = LinearRegression()
model.fit(X_train, y_train)

# Memprediksi nilai 'sales' untuk DataFrame test
test['sales'] = model.predict(X_test)

Predicted 'Sales' Value in test_data¶

In [19]:
print(test[['date', 'onpromotion', 'store_nbr', 'sales']].head(100))
         date  onpromotion  store_nbr   sales
0  2017-08-16            0          1  185.84
1  2017-08-16            0          1  185.84
2  2017-08-16            2          1  262.17
3  2017-08-16           20          1  949.14
4  2017-08-16            0          1  185.84
..        ...          ...        ...     ...
95 2017-08-16            0         11  221.78
96 2017-08-16          163         11 6442.74
97 2017-08-16           14         11  756.09
98 2017-08-16            0         11  221.78
99 2017-08-16            0         12  225.37

[100 rows x 4 columns]

6. Holidays and Events¶

What a mess! Probably, you are confused due to the holidays and events data. It contains a lot of information inside but, don't worry. You just need to take a breathe and think! It is a meta-data so you have to split it logically and make the data useful.

What are our problems?

  • Some national holidays have been transferred.
  • There might be a few holidays in one day. When we merged all of data, number of rows might increase. We don't want duplicates.
  • What is the scope of holidays? It can be regional or national or local. You need to split them by the scope.
  • Work day issue
  • Some specific events
  • Creating new features etc.

End of the section, they won't be a problem anymore!

In [20]:
# Read the holidays dataset
holidays = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\holidays_events.csv')
holidays["date"] = pd.to_datetime(holidays.date)

# Separate transferred holidays and transfer type holidays
tr1 = holidays[(holidays.type == "Holiday") & (holidays.transferred == True)].drop("transferred", axis=1).reset_index(drop=True)
tr2 = holidays[holidays.type == "Transfer"].drop("transferred", axis=1).reset_index(drop=True)
tr = pd.concat([tr1, tr2], axis=1)
tr = tr.iloc[:, [5, 1, 2, 3, 4]]

# Filter out non-transferred holidays and non-transfer type holidays
holidays = holidays[(holidays.transferred == False) & (holidays.type != "Transfer")].drop("transferred", axis=1)

# Concatenate the dataframes and reset index
holidays = pd.concat([holidays, tr], ignore_index=True)

# Display the resulting DataFrame
print(holidays.head())
        date     type    locale locale_name                    description
0 2012-03-02  Holiday     Local       Manta             Fundacion de Manta
1 2012-04-01  Holiday  Regional    Cotopaxi  Provincializacion de Cotopaxi
2 2012-04-12  Holiday     Local      Cuenca            Fundacion de Cuenca
3 2012-04-14  Holiday     Local    Libertad      Cantonizacion de Libertad
4 2012-04-21  Holiday     Local    Riobamba      Cantonizacion de Riobamba
In [27]:
import gc

# Read and preprocess holidays dataset
holidays = pd.read_csv(r'C:\Users\Deni Rifqi\Downloads\holidays_events.csv')
holidays["date"] = pd.to_datetime(holidays.date)

# Separate transferred holidays and transfer type holidays
tr1 = holidays[(holidays.type == "Holiday") & (holidays.transferred == True)].drop("transferred", axis=1).reset_index(drop=True)
tr2 = holidays[holidays.type == "Transfer"].drop("transferred", axis=1).reset_index(drop=True)
tr = pd.concat([tr1, tr2], axis=1)
tr = tr.iloc[:, [5, 1, 2, 3, 4]]

# Filter out non-transferred holidays and non-transfer type holidays
holidays = holidays[(holidays.transferred == False) & (holidays.type != "Transfer")].drop("transferred", axis=1)

# Concatenate the DataFrames using pd.concat() instead of append
holidays = pd.concat([holidays, tr], ignore_index=True)

# Additional Holidays
holidays["description"] = holidays["description"].str.replace("-", "").str.replace("+", "").str.replace('\d+', '')
holidays["type"] = np.where(holidays["type"] == "Additional", "Holiday", holidays["type"])

# Bridge Holidays
holidays["description"] = holidays["description"].str.replace("Puente ", "")
holidays["type"] = np.where(holidays["type"] == "Bridge", "Holiday", holidays["type"])

# Work Day Holidays, meant to payback the Bridge.
work_day = holidays[holidays.type == "Work Day"]  
holidays = holidays[holidays.type != "Work Day"]

# Split the holidays data into different categories
events = holidays[holidays.type == "Event"].drop(["type", "locale", "locale_name"], axis=1).rename({"description": "events"}, axis=1)
holidays = holidays[holidays.type != "Event"].drop("type", axis=1)
regional = holidays[holidays.locale == "Regional"].rename({"locale_name": "state", "description": "holiday_regional"}, axis=1).drop("locale", axis=1).drop_duplicates()
national = holidays[holidays.locale == "National"].rename({"description": "holiday_national"}, axis=1).drop(["locale", "locale_name"], axis=1).drop_duplicates()
local = holidays[holidays.locale == "Local"].rename({"description": "holiday_local", "locale_name": "city"}, axis=1).drop("locale", axis=1).drop_duplicates()

# Merge with main data
d = pd.merge(pd.concat([train, test]), stores)
d["store_nbr"] = d["store_nbr"].astype("int8")

# Merge national holidays & events
d = pd.merge(d, national, how="left")
d = pd.merge(d, regional, how="left", on=["date", "state"])
d = pd.merge(d, local, how="left", on=["date", "city"])

# Merge work day information
d = pd.merge(d, work_day[["date", "type"]].rename({"type": "IsWorkDay"}, axis=1), how="left")

# Process events
events["events"] = np.where(events.events.str.contains("futbol"), "Futbol", events.events)

def one_hot_encoder(df, nan_as_category=True):
    original_columns = list(df.columns)
    categorical_columns = df.select_dtypes(["category", "object"]).columns.tolist()
    df = pd.get_dummies(df, columns=categorical_columns, dummy_na=nan_as_category)
    new_columns = [c for c in df.columns if c not in original_columns]
    df.columns = df.columns.str.replace(" ", "_")
    return df, df.columns.tolist()

events, events_cat = one_hot_encoder(events, nan_as_category=False)
events["events_Dia_de_la_Madre"] = np.where(events.date == "2016-05-08", 1, events["events_Dia_de_la_Madre"])
events = events.drop(239)

d = pd.merge(d, events, how="left")
d[events_cat] = d[events_cat].fillna(0)

# Create new features
d["holiday_national_binary"] = np.where(d.holiday_national.notnull(), 1, 0)
d["holiday_local_binary"] = np.where(d.holiday_local.notnull(), 1, 0)
d["holiday_regional_binary"] = np.where(d.holiday_regional.notnull(), 1, 0)
d["national_independence"] = np.where(d.holiday_national.isin(['Batalla de Pichincha', 'Independencia de Cuenca', 'Independencia de Guayaquil', 'Independencia de Guayaquil', 'Primer Grito de Independencia']), 1, 0)
d["local_cantonizacio"] = np.where(d.holiday_local.str.contains("Cantonizacio"), 1, 0)
d["local_fundacion"] = np.where(d.holiday_local.str.contains("Fundacion"), 1, 0)
d["local_independencia"] = np.where(d.holiday_local.str.contains("Independencia"), 1, 0)

holidays, holidays_cat = one_hot_encoder(d[["holiday_national", "holiday_regional", "holiday_local"]], nan_as_category=False)
d = pd.concat([d.drop(["holiday_national", "holiday_regional", "holiday_local"], axis=1), holidays], axis=1)

he_cols = d.columns[d.columns.str.startswith("events")].tolist() + d.columns[d.columns.str.startswith("holiday")].tolist() + d.columns[d.columns.str.startswith("national")].tolist() + d.columns[d.columns.str.startswith("local")].tolist()
d[he_cols] = d[he_cols].astype("int8")

d[["family", "city", "state", "type"]] = d[["family", "city", "state", "type"]].astype("category")

# Clean up
del holidays, holidays_cat, work_day, local, regional, national, events, events_cat, tr, tr1, tr2, he_cols
gc.collect()

d.head(10)
Out[27]:
id date store_nbr family sales onpromotion city state type cluster IsWorkDay events_Black_Friday events_Cyber_Monday events_Dia_de_la_Madre events_Futbol events_Terremoto_Manabi events_Terremoto_Manabi1 events_Terremoto_Manabi10 events_Terremoto_Manabi11 events_Terremoto_Manabi12 events_Terremoto_Manabi13 events_Terremoto_Manabi14 events_Terremoto_Manabi15 events_Terremoto_Manabi16 events_Terremoto_Manabi17 events_Terremoto_Manabi18 events_Terremoto_Manabi19 events_Terremoto_Manabi2 events_Terremoto_Manabi20 events_Terremoto_Manabi21 events_Terremoto_Manabi22 events_Terremoto_Manabi23 events_Terremoto_Manabi24 events_Terremoto_Manabi25 events_Terremoto_Manabi26 events_Terremoto_Manabi27 events_Terremoto_Manabi28 events_Terremoto_Manabi29 events_Terremoto_Manabi3 events_Terremoto_Manabi30 events_Terremoto_Manabi4 events_Terremoto_Manabi5 events_Terremoto_Manabi6 events_Terremoto_Manabi7 events_Terremoto_Manabi8 events_Terremoto_Manabi9 holiday_national_binary holiday_local_binary holiday_regional_binary national_independence local_cantonizacio local_fundacion local_independencia holiday_national_Batalla_de_Pichincha holiday_national_Carnaval holiday_national_Dia_de_Difuntos holiday_national_Dia_de_la_Madre1 holiday_national_Dia_del_Trabajo holiday_national_Independencia_de_Cuenca holiday_national_Independencia_de_Guayaquil holiday_national_Navidad holiday_national_Navidad1 holiday_national_Navidad2 holiday_national_Navidad3 holiday_national_Navidad4 holiday_national_Primer_Grito_de_Independencia holiday_national_Primer_dia_del_ano holiday_national_Primer_dia_del_ano1 holiday_national_Viernes_Santo holiday_regional_Provincializacion_Santa_Elena holiday_regional_Provincializacion_de_Cotopaxi holiday_regional_Provincializacion_de_Imbabura holiday_regional_Provincializacion_de_Santo_Domingo holiday_local_Cantonizacion_de_Cayambe holiday_local_Cantonizacion_de_El_Carmen holiday_local_Cantonizacion_de_Guaranda holiday_local_Cantonizacion_de_Latacunga holiday_local_Cantonizacion_de_Libertad holiday_local_Cantonizacion_de_Quevedo holiday_local_Cantonizacion_de_Riobamba holiday_local_Cantonizacion_de_Salinas holiday_local_Cantonizacion_del_Puyo holiday_local_Fundacion_de_Ambato holiday_local_Fundacion_de_Cuenca holiday_local_Fundacion_de_Esmeraldas holiday_local_Fundacion_de_Guayaquil holiday_local_Fundacion_de_Guayaquil1 holiday_local_Fundacion_de_Ibarra holiday_local_Fundacion_de_Loja holiday_local_Fundacion_de_Machala holiday_local_Fundacion_de_Manta holiday_local_Fundacion_de_Quito holiday_local_Fundacion_de_Quito1 holiday_local_Fundacion_de_Riobamba holiday_local_Fundacion_de_Santo_Domingo holiday_local_Independencia_de_Ambato holiday_local_Independencia_de_Guaranda holiday_local_Independencia_de_Latacunga
0 0 2013-01-01 1 AUTOMOTIVE 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 2013-01-01 1 BABY CARE 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 2013-01-01 1 BEAUTY 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 2013-01-01 1 BEVERAGES 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 2013-01-01 1 BOOKS 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 5 2013-01-01 1 BREAD/BAKERY 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 6 2013-01-01 1 CELEBRATION 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 7 2013-01-01 1 CLEANING 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 8 2013-01-01 1 DAIRY 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 9 2013-01-01 1 DELI 0.00 0.00 Quito Pichincha D 13 NaN 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Let's perform an AB test on the Events and Holidays features to determine their statistical significance. This could also serve as an initial method for feature selection.

  • Null Hypothesis (H0): Sales are equal (M1 = M2)
  • Alternative Hypothesis (H1): Sales are not equal (M1 ≠ M2)
In [28]:
def AB_Test(dataframe, group, target):
    
    # Packages
    from scipy.stats import shapiro
    import scipy.stats as stats
    
    # Split A/B
    groupA = dataframe[dataframe[group] == 1][target]
    groupB = dataframe[dataframe[group] == 0][target]

    # Assumption: Normality
    ntA = shapiro(groupA)[1] < 0.05
    ntB = shapiro(groupB)[1] < 0.05
    # H0: Distribution is Normal! - False
    # H1: Distribution is not Normal! - True

    if (ntA == False) & (ntB == False): # "H0: Normal Distribution"
        # Parametric Test
        # Assumption: Homogeneity of variances
        leveneTest = stats.levene(groupA, groupB)[1] < 0.05
        # H0: Homogeneity: False
        # H1: Heterogeneous: True
        
        if leveneTest == False:
            # Homogeneity
            ttest = stats.ttest_ind(groupA, groupB, equal_var=True)[1]
            # H0: M1 == M2 - False
            # H1: M1 != M2 - True
        else:
            # Heterogeneous
            ttest = stats.ttest_ind(groupA, groupB, equal_var=False)[1]
            # H0: M1 == M2 - False
            # H1: M1 != M2 - True
    else:
        # Non-Parametric Test
        ttest = stats.mannwhitneyu(groupA, groupB)[1] 
        # H0: M1 == M2 - False
        # H1: M1 != M2 - True

    # Result
    temp = pd.DataFrame({
        "AB Hypothesis":[ttest < 0.05], 
        "p-value":[ttest]
    })
    temp["Test Type"] = np.where((ntA == False) & (ntB == False), "Parametric", "Non-Parametric")
    temp["AB Hypothesis"] = np.where(temp["AB Hypothesis"] == False, "Fail to Reject H0", "Reject H0")
    temp["Comment"] = np.where(temp["AB Hypothesis"] == "Fail to Reject H0", "A/B groups are similar!", "A/B groups are not similar!")
    temp["Feature"] = group
    temp["GroupA_mean"] = groupA.mean()
    temp["GroupB_mean"] = groupB.mean()
    temp["GroupA_median"] = groupA.median()
    temp["GroupB_median"] = groupB.median()
    
    # Columns
    if (ntA == False) & (ntB == False):
        temp["Homogeneity"] = np.where(leveneTest == False, "Yes", "No")
        temp = temp[["Feature","Test Type", "Homogeneity","AB Hypothesis", "p-value", "Comment", "GroupA_mean", "GroupB_mean", "GroupA_median", "GroupB_median"]]
    else:
        temp = temp[["Feature","Test Type","AB Hypothesis", "p-value", "Comment", "GroupA_mean", "GroupB_mean", "GroupA_median", "GroupB_median"]]
    
    # Print Hypothesis
    # print("# A/B Testing Hypothesis")
    # print("H0: A == B")
    # print("H1: A != B", "\n")
    
    return temp
        
    
In [29]:
# Apply A/B Testing
he_cols = d.columns[d.columns.str.startswith("events")].tolist() + d.columns[d.columns.str.startswith("holiday")].tolist() + d.columns[d.columns.str.startswith("national")].tolist()+ d.columns[d.columns.str.startswith("local")].tolist()
ab = []
for i in he_cols:
    ab.append(AB_Test(dataframe=d[d.sales.notnull()], group = i, target = "sales"))
ab = pd.concat(ab)
ab
Out[29]:
Feature Test Type AB Hypothesis p-value Comment GroupA_mean GroupB_mean GroupA_median GroupB_median
0 events_Black_Friday Non-Parametric Reject H0 0.00 A/B groups are not similar! 382.24 387.89 23.00 17.00
0 events_Cyber_Monday Non-Parametric Reject H0 0.00 A/B groups are not similar! 458.88 387.75 21.00 17.00
0 events_Dia_de_la_Madre Non-Parametric Reject H0 0.00 A/B groups are not similar! 374.92 387.91 14.00 17.00
0 events_Futbol Non-Parametric Reject H0 0.00 A/B groups are not similar! 349.84 388.18 11.00 17.00
0 events_Terremoto_Manabi Non-Parametric Reject H0 0.00 A/B groups are not similar! 492.92 387.81 34.00 17.00
... ... ... ... ... ... ... ... ... ...
0 holiday_local_Independencia_de_Latacunga Non-Parametric Reject H0 0.00 A/B groups are not similar! 170.47 387.90 8.50 17.00
0 national_independence Non-Parametric Reject H0 0.00 A/B groups are not similar! 443.59 387.28 20.00 17.00
0 local_cantonizacio Non-Parametric Reject H0 0.00 A/B groups are not similar! 387.52 488.00 17.00 23.00
0 local_fundacion Non-Parametric Reject H0 0.00 A/B groups are not similar! 387.99 239.87 17.00 11.00
0 local_independencia Non-Parametric Reject H0 0.00 A/B groups are not similar! 387.57 467.87 17.00 22.00

87 rows × 9 columns

In [30]:
# Check if 'd' is defined
if 'd' in locals():
    print("DataFrame 'd' is defined and ready for use.")
else:
    print("DataFrame 'd' is not defined. Please define it before running the A/B tests.")
DataFrame 'd' is defined and ready for use.
In [35]:
# Average sales for different product categories (family) during periods with and without a "Black Friday" event (events_Black_Friday).

d.groupby(["family","events_Black_Friday"]).sales.mean()[:20]
Out[35]:
family        events_Black_Friday
AUTOMOTIVE    0                        9.39
              1                        5.89
BABY CARE     0                        2.98
              1                        0.08
BEAUTY        0                        7.14
              1                        3.52
BEVERAGES     0                     2562.21
              1                     2766.75
BOOKS         0                        2.93
              1                        0.23
BREAD/BAKERY  0                      502.55
              1                      422.58
CELEBRATION   0                       11.80
              1                       15.84
CLEANING      0                     1155.80
              1                      982.76
DAIRY         0                      769.87
              1                      684.00
DELI          0                      291.73
              1                      265.42
Name: sales, dtype: float64

e significantly to 15.84.

Summary:¶

The results indicate how average sales for various product categories change during the Black Friday period compared to normal periods. For some categories like "BEVERAGES" and "CELEBRATION," sales increase during Black Friday, possibly due to promotions or increased consumer spending during this time. However, other categories like "AUTOMOTIVE" and "CLEANING" see a decrease in average sales during Black Friday, which could suggest that customers prioritize different products during this event.

7. Simple Moving Average¶

In [36]:
a = train.sort_values(["store_nbr", "family", "date"])
for i in [20, 30, 45, 60, 90, 120, 365, 730]:
    a["SMA"+str(i)+"_sales_lag16"] = a.groupby(["store_nbr", "family"]).rolling(i).sales.mean().shift(16).values
    a["SMA"+str(i)+"_sales_lag30"] = a.groupby(["store_nbr", "family"]).rolling(i).sales.mean().shift(30).values
    a["SMA"+str(i)+"_sales_lag60"] = a.groupby(["store_nbr", "family"]).rolling(i).sales.mean().shift(60).values
print("Correlation")
a[["sales"]+a.columns[a.columns.str.startswith("SMA")].tolist()].corr()
Correlation
Out[36]:
sales SMA20_sales_lag16 SMA20_sales_lag30 SMA20_sales_lag60 SMA30_sales_lag16 SMA30_sales_lag30 SMA30_sales_lag60 SMA45_sales_lag16 SMA45_sales_lag30 SMA45_sales_lag60 SMA60_sales_lag16 SMA60_sales_lag30 SMA60_sales_lag60 SMA90_sales_lag16 SMA90_sales_lag30 SMA90_sales_lag60 SMA120_sales_lag16 SMA120_sales_lag30 SMA120_sales_lag60 SMA365_sales_lag16 SMA365_sales_lag30 SMA365_sales_lag60 SMA730_sales_lag16 SMA730_sales_lag30 SMA730_sales_lag60
sales 1.00 0.91 0.89 0.87 0.91 0.89 0.87 0.91 0.90 0.87 0.91 0.89 0.87 0.91 0.89 0.87 0.90 0.89 0.86 0.90 0.89 0.87 0.91 0.89 0.86
SMA20_sales_lag16 0.91 1.00 0.99 0.95 1.00 0.98 0.95 0.99 0.98 0.95 0.99 0.98 0.95 0.99 0.98 0.95 0.98 0.97 0.95 0.97 0.97 0.95 0.98 0.98 0.95
SMA20_sales_lag30 0.89 0.99 1.00 0.96 0.99 1.00 0.96 1.00 0.99 0.97 0.99 0.99 0.97 0.99 0.99 0.96 0.99 0.98 0.96 0.97 0.97 0.96 0.98 0.98 0.96
SMA20_sales_lag60 0.87 0.95 0.96 1.00 0.96 0.98 1.00 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.98 0.97 0.97 0.98 0.98 0.98
SMA30_sales_lag16 0.91 1.00 0.99 0.96 1.00 0.99 0.96 1.00 0.99 0.96 0.99 0.99 0.96 0.99 0.98 0.96 0.99 0.98 0.96 0.98 0.97 0.96 0.98 0.98 0.96
SMA30_sales_lag30 0.89 0.98 1.00 0.98 0.99 1.00 0.98 1.00 1.00 0.98 1.00 0.99 0.98 0.99 0.99 0.97 0.99 0.99 0.97 0.98 0.98 0.97 0.98 0.98 0.98
SMA30_sales_lag60 0.87 0.95 0.96 1.00 0.96 0.98 1.00 0.98 0.99 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.98 0.98 0.98
SMA45_sales_lag16 0.91 0.99 1.00 0.98 1.00 1.00 0.98 1.00 1.00 0.98 1.00 0.99 0.98 0.99 0.99 0.98 0.99 0.99 0.97 0.98 0.98 0.98 0.98 0.98 0.98
SMA45_sales_lag30 0.90 0.98 0.99 0.99 0.99 1.00 0.99 1.00 1.00 0.99 1.00 1.00 0.99 1.00 0.99 0.98 0.99 0.99 0.98 0.98 0.98 0.98 0.98 0.98 0.98
SMA45_sales_lag60 0.87 0.95 0.97 0.99 0.96 0.98 1.00 0.98 0.99 1.00 0.99 0.99 1.00 0.99 1.00 0.99 0.99 1.00 0.99 0.98 0.98 0.98 0.98 0.98 0.98
SMA60_sales_lag16 0.91 0.99 0.99 0.99 0.99 1.00 0.99 1.00 1.00 0.99 1.00 1.00 0.99 1.00 0.99 0.98 0.99 0.99 0.98 0.98 0.98 0.98 0.98 0.98 0.98
SMA60_sales_lag30 0.89 0.98 0.99 0.99 0.99 0.99 0.99 0.99 1.00 0.99 1.00 1.00 0.99 1.00 1.00 0.99 1.00 0.99 0.98 0.98 0.98 0.98 0.98 0.98 0.98
SMA60_sales_lag60 0.87 0.95 0.97 0.99 0.96 0.98 0.99 0.98 0.99 1.00 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.98 0.98 0.98 0.98
SMA90_sales_lag16 0.91 0.99 0.99 0.99 0.99 0.99 0.99 0.99 1.00 0.99 1.00 1.00 0.99 1.00 1.00 0.99 1.00 1.00 0.99 0.99 0.99 0.98 0.99 0.98 0.98
SMA90_sales_lag30 0.89 0.98 0.99 0.99 0.98 0.99 0.99 0.99 0.99 1.00 0.99 1.00 1.00 1.00 1.00 0.99 1.00 1.00 0.99 0.99 0.99 0.98 0.98 0.99 0.98
SMA90_sales_lag60 0.87 0.95 0.96 0.99 0.96 0.97 0.99 0.98 0.98 0.99 0.98 0.99 1.00 0.99 0.99 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.98 0.98 0.99
SMA120_sales_lag16 0.90 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99 0.99
SMA120_sales_lag30 0.89 0.97 0.98 0.99 0.98 0.99 0.99 0.99 0.99 1.00 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99
SMA120_sales_lag60 0.86 0.95 0.96 0.98 0.96 0.97 0.99 0.97 0.98 0.99 0.98 0.98 0.99 0.99 0.99 1.00 0.99 1.00 1.00 0.99 0.99 0.99 0.99 0.99 0.99
SMA365_sales_lag16 0.90 0.97 0.97 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
SMA365_sales_lag30 0.89 0.97 0.97 0.97 0.97 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
SMA365_sales_lag60 0.87 0.95 0.96 0.97 0.96 0.97 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
SMA730_sales_lag16 0.91 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.98 0.98 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
SMA730_sales_lag30 0.89 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.98 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00
SMA730_sales_lag60 0.86 0.95 0.96 0.98 0.96 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00

Summary of the Correlation Results:¶

The correlation matrix displays the relationship between the original sales and various lagged Simple Moving Averages (SMA) of different window sizes. Here's a summary of the key points:

  1. Strong Correlation with Sales:

    • The original sales variable has a strong positive correlation with all the lagged SMA features, with correlations ranging from 0.86 to 0.91. This indicates that past sales data, averaged over different periods, is highly predictive of current sales.
  2. Correlation Among SMA Features:

    • The SMA features are also highly correlated with each other. For example, the correlation between SMA20_sales_lag16 and SMA20_sales_lag30 is 0.99. This high correlation suggests that different lag periods for the same window size produce similar trends in the data.
  3. Decreasing Correlation with Longer Lags:

    • As the lag increases (e.g., from 16 days to 60 days), the correlation between the SMA features and sales slightly decreases. This trend is visible across all window sizes, showing that more recent data is generally more predictive.
  4. Window Size Influence:

    • The correlation between sales and SMA features slightly decreases as the window size increases (e.g., from 20 to 730 days). However, the correlation remains strong across all window sizes, indicating that both short-term and long-term trends are relevant to predicting sales.
  5. Overall Patterns:

    • The matrix suggests that any of these SMA features could be useful predictors in a forecasting model. However, the selection of specific lags and window sizes may depend on the desired balance between capturing short-term versus long-term trends in the data.

Implications for Modeling:¶

  • The strong correlations indicate that the SMA features will likely be valuable in a sales forecasting model. However, the high multicollinearity among these features may require dimensionality reduction techniques, such as PCA, or the careful selection of non-redundant features to avoid overfitting in predictive models.
In [38]:
b = a[(a.store_nbr == 5)].set_index("date")
for i in b.family.unique():
    fig, ax = plt.subplots(2,4,figsize=(20,10))
    b[b.family == i][["sales", "SMA20_sales_lag16"]].plot(legend = True, ax = ax[0,0], linewidth = 4)
    b[b.family == i][["sales", "SMA30_sales_lag16"]].plot(legend = True, ax = ax[0,1], linewidth = 4)
    b[b.family == i][["sales", "SMA45_sales_lag16"]].plot(legend = True, ax = ax[0,2], linewidth = 4)
    b[b.family == i][["sales", "SMA60_sales_lag16"]].plot(legend = True, ax = ax[0,3], linewidth = 4)
    b[b.family == i][["sales", "SMA90_sales_lag16"]].plot(legend = True, ax = ax[1,0], linewidth = 4)
    b[b.family == i][["sales", "SMA120_sales_lag16"]].plot(legend = True, ax = ax[1,1], linewidth = 4)
    b[b.family == i][["sales", "SMA365_sales_lag16"]].plot(legend = True, ax = ax[1,2], linewidth = 4)
    b[b.family == i][["sales", "SMA730_sales_lag16"]].plot(legend = True, ax = ax[1,3], linewidth = 4)
    plt.suptitle("STORE 1 - "+i, fontsize = 15)
    plt.tight_layout(pad = 1.5)
    for j in range(0,4):
        ax[0,j].legend(fontsize="x-large")
        ax[1,j].legend(fontsize="x-large")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image